library(tidyverse)
library(infer)
library(mosaic)
library(caret)
library(ggplot2)
library(mdsr)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(pROC)
library(gbm)
library(Metrics)
library(viridis)
require(maps)
require(viridis)
theme_set(
theme_void()
)
library(knitr)
library(tmap)
library(ggpubr)
The Premise
Hello! This project is the final for my STAT228: Introduction to Data
Science course at Simmons University. With this project, I’d like to
encompass everything I’ve learned throughout the semester, as well as
some additional data science methods that I have taught myself outside
of class. (The primary non-curriculum methods I am using come from the
package tmap, which I was made aware of from a LinkedIn
post by Joachim Schork, a data science educator & consultant from
Germany). The premise of my project is to predict & analyze Women’s
Empowerment Index scores for countries ; in this project, I aim to find
the best version of the model predicting WEI scores using a variety of
ensemble methods. There are two datasets I’m interested in using
here:
I’d like to join the two datasets by the common variable “Country”,
and analyze WEI scores by health factors related to the patient’s
country. I am interested in creating several maps that will visualize
WEI scores against other health-based factors.
Cleaning the Data
I want to start by filtering led to only include data where
the year = 2015, because this is the most recent year within this
dataset.
led_2015 <- led%>%
filter(Year=="2015")
Now, I want to rename several datapoints within the led_2015
dataset because I want the data names to be congruent between
led_2015 and wei. This will be very tedious, but it is
necessary to be able to join the datasets!
led_2015[led_2015$Country == "Bolivia(PlurinationalStateof)", "Country"] <- "Bolivia"
wei[wei$Country == "Bolivia(Plurinational State of)", "Country"] <- "Bolivia"
led_2015[led_2015$Country == "BosniaandHerzegovina", "Country"] <- "Bosnia and Herzegovina"
led_2015[led_2015$Country == "BurkinaFaso", "Country"] <- "Burkina Faso"
wei[wei$Country == "Congo (Democratic Republic of the)", "Country"] <- "DemocraticRepublicoftheCongo"
led_2015[led_2015$Country == "CostaRica", "Country"] <- "Costa Rica"
led_2015[led_2015$Country == "DominicanRepublic", "Country"] <- "Dominican Republic"
led_2015[led_2015$Country == "ElSalvador", "Country"] <- "El Salvador"
led_2015[led_2015$Country == "Iran(IslamicRepublicof)", "Country"] <- "Iran (Islamic Republic of)"
led_2015[led_2015$Country == "LaoPeople'sDemocraticRepublic", "Country"] <- "Laos"
wei[wei$Country == "Lao People's Democratic Republic", "Country"] <- "Laos"
led_2015[led_2015$Country == "RepublicofMoldova", "Country"] <- "Moldova (Republic of)"
led_2015[led_2015$Country == "SierraLeone", "Country"] <- "Sierra Leone"
led_2015[led_2015$Country == "SouthAfrica", "Country"] <- "South Africa"
led_2015[led_2015$Country == "SriLanka", "Country"] <- "Sri Lanka"
led_2015[led_2015$Country == "TheformerYugoslavrepublicofMacedonia", "Country"] <- "North Macedonia"
led_2015[led_2015$Country == "UnitedRepublicofTanzania", "Country"] <- "Tanzania (United Republic of)"
wei[wei$Country == "Türkiye", "Country"] <- "Turkey"
led_2015[led_2015$Country == "UnitedArabEmirates", "Country"] <- "United Arab Emirates"
led_2015[led_2015$Country == "UnitedKingdomofGreatBritainandNorthernIreland", "Country"] <- "United Kingdom"
led_2015[led_2015$Country == "UnitedStatesofAmerica", "Country"] <- "United States"
led_2015[led_2015$Country == "VietNam", "Country"] <- "Viet Nam"
That took a WHILE! But, now my two datasets should be better matched,
and it’ll be much easier to join them!
Now I’m going to remove some unnecessary variables from the datasets
This is because some variables have too many missing datapoints to be
useful, or because some variables may be redundant.
led_2015 = subset(led_2015, select = -c(Alcohol, Totalexpenditure,Year))
Now, I will be joining the two datasets using inner_join, because I
only want the countries in common between both dataframes. I’m also
going to create a second version that is full_joined, just in case I end
up needing it!
dataset <- wei%>% inner_join(led_2015,by="Country")
dataset2 <- wei%>% full_join(led_2015,by="Country")
dim(dataset)
## [1] 112 25
Finally, I will be renaming a few variables because I dislike having
spaces in my variable names. After this, I will have finished cleaning
the data! I’m also removing a few more variables just because they seem
rather redundant - for example, the Gender Parity Group and Gender
Parity Index variables may be too similar to WEI, therefore they might
have skewed levels of correlation.
dataset <- dataset %>%
rename(WEI = "Women's Empowerment Index (WEI) - 2022")
dataset <- dataset %>%
rename(WEG = "Women's Empowerment Group - 2022")
dataset <- dataset %>%
rename(GGPI = "Global Gender Parity Index (GGPI) - 2022")
dataset <- dataset %>%
rename(HDG = "Human Development Group - 2021")
dataset <- dataset %>%
rename(GPG = "Gender Parity Group - 2022")
dataset <- dataset %>%
rename(SDG_Region = "Sustainable Development Goal regions")
dataset = subset(dataset, select = -c(GPG, SDG_Region,GGPI))
All finished cleaning! Now I want to take a peek at the dataset I’ll
be working with before I go into the next steps.
dim(dataset)
## [1] 112 22
summary(dataset$WEI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1410 0.5180 0.6150 0.6085 0.7070 0.8280
tally(dataset$Status)
## X
## Developed Developing
## 31 81
Intial Visualization
Now, I want to create a map of WEI by country, just to get a general
visual of how the distribution looks.
I’m going to start out by creating a joined dataset between
dataset and World, so I can have a mappable dataset
using the tmap package.
library(tmap)
data("World")
dataset<- dataset %>%
rename(name = Country)
wei.map <- inner_join(dataset,World,by="name")
Unfortunately, the wei.map only contains 98 individuals
despite dataset containing 112 individuals! This means that 14
individuals have disharmonious names between dataset and
World. I’m going to run some code renaming a whole bunch of
countries, just like I did above with the led_2015 dataset. I’m
going to hide this code, just because it’s super tedious work. Tidying
& cleaning your data is a constant process!
dataset[dataset$name == "Bosnia and Herzegovina", "name"] <- "Bosnia and Herz."
dataset[dataset$name == "DemocraticRepublicoftheCongo", "name"] <- "Dem. Rep. Congo"
dataset[dataset$name == "Dominican Republic", "name"] <- "Dominican Rep."
dataset[dataset$name == "Iran (Islamic Republic of)", "name"] <- "Iran"
dataset[dataset$name == "Laos", "name"] <- "Lao PDR"
dataset[dataset$name == "North Macedonia", "name"] <- "Macedonia"
dataset[dataset$name == "Moldova (Republic of)", "name"] <- "Moldova"
dataset[dataset$name == "Tanzania (United Republic of)", "name"] <- "Tanzania"
dataset[dataset$name == "Viet Nam", "name"] <- "Vietnam"
Now, let’s try re-joining these datasets!
wei.map <- right_join(dataset,World,by="name")
library(s2)
library(sf)
## Warning: package 'sf' was built under R version 4.3.2
wei.map.sf <- wei.map %>%
st_as_sf()
And visualizing:
tmap_mode("plot")
tm_shape(wei.map.sf) +
tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)
Lovely! I also made an interactive version of the map, just for further
exploration of the tmap package.
tmap_mode("view")
tm_shape(wei.map.sf) +
tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)
Now, I want to move onto creating models that will help me predict
the WEI scores of countries based on other factors.
Decision Tree
I’m going to start by creating a decision tree for my
dataset. I believe this will be beneficial to see, seeing as
the decision tree will select the best predictors for my response
variable. The first step here is to split my data into training and
testing data.
set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
I’m going to look at the proportion of developed vs developing
countries within both my test and train data. This is
so I can ensure that test is at least somewhat representative
of the distribution within train.
tally(test$Status, format='proportion')
## X
## Developed Developing
## 0.3 0.7
tally(train$Status, format = 'proportion')
## X
## Developed Developing
## 0.2717391 0.7282609
tally(dataset$Status, format = 'proportion')
## X
## Developed Developing
## 0.2767857 0.7232143
Both test and train appear to be adequately
representative of dataset!
Now, I will create my decision tree. I will fit the decision tree to
train, and I will set nearly every predictor variable (BESIDES
NAME & WEG, for reasons of redundancy) as my X. I am doing this so
that the decision tree can tell me which predictor variables have the
highest influence on my response variable, WEI.
tree <- rpart(WEI~ HDG + Status + Lifeexpectancy + AdultMortality + infantdeaths + HepatitisB + Measles + Polio + Diphtheria + Population + Schooling, data=train, cp=.01)
rpart.plot(tree)

It appears there are three predictors that have the highest
correlation with WEI scores : HDG (Human Development Group), Years of
Schooling, and Adult Mortality Rates. For further exploration, I’m going
to create a Random Forest to find our best model.
Random Forest & RMSE
I’m going to have to not use any variables that have “NA” values to
be able to properly crearte a random forest. This unfortunately means
removing Schooling. However, I don’t want to remove Schooling because
according to the decision tree, it’s one of the most important
predictors! So, I am going to use regression imputation to create some
estimated Schooling values for any of our countries that have Schooling
= NA.
missing <- is.na(dataset$Schooling)
sum(missing)
## [1] 6
which(missing)
## [1] 11 30 31 47 72 105
lm(Schooling~WEI, data=dataset)
##
## Call:
## lm(formula = Schooling ~ WEI, data = dataset)
##
## Coefficients:
## (Intercept) WEI
## 2.574 17.943
dataset$Schooling[11]<-2.574+17.943*(0.707)
dataset$Schooling[30]<-2.574+17.943*(0.778)
dataset$Schooling[31]<-2.574+17.943*(0.752)
dataset$Schooling[47]<-2.574+17.943*(0.686)
dataset$Schooling[72]<-2.574+17.943*(0.399)
dataset$Schooling[105]<-2.574+17.943*(0.510)
Yet another example of never being done with cleaning the data! Now,
I have to recreate my training and testing dataset without these missing
values.
set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
Then, I can finally create my Random Forest
set.seed(228)
rf_model <-randomForest(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles + Polio + Diphtheria ,data=train, proximity=TRUE, ntree=1000)
rf_model
##
## Call:
## randomForest(formula = WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles + Polio + Diphtheria, data = train, proximity = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.005114937
## % Var explained: 72
The mean of squared residuals is pretty close to 0, which means the
model is good! However, I’m going to also try running a Random Forest
model in which I only use the three best predictor variables as
identified by my Decision Tree. I will then compare the two mean of
squared residuals.
set.seed(228)
rf_model2 <-randomForest(WEI ~ HDG + Schooling + AdultMortality ,data=train, proximity=TRUE, ntree=1000)
rf_model2
##
## Call:
## randomForest(formula = WEI ~ HDG + Schooling + AdultMortality, data = train, proximity = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.005898256
## % Var explained: 67.72
The mean of squared residuals is ever so slightly higher for this
model, and the Var explained is slightly lower. I’m going to use an RMSE
test to compare the two models.
Now, I’m going to boost the two models. Before doing so, I am
changing HDG’s class to ordered, and Status’s class to ordered, so they
arecompatible with the boosted model.
train$HDG <- as.ordered(train$HDG)
train$Status <- as.ordered(train$Status)
boost1 <- gbm(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles + Polio + Diphtheria, data=train,distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)
boost2 <- gbm(WEI~ Schooling + HDG + AdultMortality , data = train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)
Now I will evaluate the two models using predictions & RMSE.
predictions1 <- predict(boost1,newdata=test)
rmse(actual=test$WEI, predicted=predictions1)
## [1] 0.0567137
predictions2 <- predict(boost2, newdata = test)
rmse(actual=test$WEI, predicted=predictions2)
## [1] 0.05870663
Once again, the model with nearly very variable has proved to perform
slightly better than the model with only 3 predictor variables.
However, this difference is so small, and I think it is
worth it to sacrifice this small difference for the sake of keeping our
model SIMPLE! The model still performs very well. So, I’m deciding to
keep the model that only contains the three best predictor
variables!
Final Visualizations
wei.map <- right_join(dataset,World,by="name")
wei.map.sf <- wei.map %>%
st_as_sf()
I’m going to create FOUR maps here: One exactly like my first
visualization that shows WEI scores by country, and then three that
shows the distribution of each of my predictor variables! I will display
all of these maps side by side.
tmap_mode("plot")
tm_shape(wei.map.sf) + tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("HDG",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Human Development Groups", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("Schooling",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Years of Schooling", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("AdultMortality",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Adult Mortality per Population of 1000", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

These are sort of confusing. For one, the HDG categories are out of
order, which is very hard to read. Secondly, the adult mortality map
should probably have the colors flipped, because lower adult mortality
corresponds to higher WEI.
I’m going to fix them, and insert the png below.
knitr::include_graphics("/Users/jasmineday/Downloads/WEIPLOT.jpg")

And there we have our final visualization! I hope you enjoyed reading
through this project as much as I enjoyed my STAT228 class :)
BONUS: Testing the Prediction Model
After finishing the first draft of my project, I figured I wanted to
add a section in which I take a small handful of countries that are NOT
represented in the wei dataset and use our model to predict the
WEI scores for these nations. I will be using the countries Kazakhstan,
New Zealand, Sudan, Haiti, and Ukraine. I will use led_2015 to
source the HDG, Schooling, and AdultMortality for each of these
countries.
finalmodel <- lm(WEI~HDG+Schooling+AdultMortality,data=dataset)
finalmodel
##
## Call:
## lm(formula = WEI ~ HDG + Schooling + AdultMortality, data = dataset)
##
## Coefficients:
## (Intercept) HDGLow HDGMedium HDGVery high Schooling
## 0.1862461 -0.0749196 -0.0202205 0.0717149 0.0282162
## AdultMortality
## 0.0002026
#Kazakhstan:
#HDG=Very high
#Schooling=15
#AdultMortality=198
Kazakhstan_WEI <- .0186 + .0717 + .0282*15 + .0002*198
#New Zealand
#HDG=Very high
#Schooling=19.2
#AdultMortality=66
NZ_WEI <- .0186 + .0717 + .0282*19.2 + .0002*66
#Sudan
#HDG=Low
#Schooling=7.2
#AdultMortality=225
Sudan_WEI <- .0186 - .0749 + .0282*7.2 + .0002*225
#Haiti
#HDG=Medium
#Schooling=9.1
#AdultMortality=24
Haiti_WEI <- .0186 - .0202 + .0282*9.1 + .0002*24
#Ukraine
#HDG=High
#Schooling=15.3
#AdultMortality=195
Ukraine_WEI <- .0186 + .0282*15.3 + .0002*195
Kazakhstan imputed/predicted WEI score = .553 New Zealand
imputed/predicted WEI score = .645 Sudan imputed/predicted WEI score =
.192 Haiti imputed/predicted WEI score = .260 Ukraine imputed/predicted
WEI score = .489
These don’t feel entirely accurate. For example, Sudan’s imputed WEI
score is just above that of Yemen’s, which is the lowest in the world.
However, it still can be beneficial to see where there may be potential
flaws in using this linear model to predict WEI scores! WEI score data
on all countries is not public, so I am unfortunately unable to
check the actual accuracy of these predictions.
---
title: "Finding the Best Model for Women's Empowerment Index Scores"
author: "Jasmine Day"
date: "May 6th 2024"
output: 
  html_document:
    code_folding: hide
    theme: flatly
    toc: true
    toc_float: true
    code_download: true
---

```{r, warning = FALSE, message = FALSE}
library(tidyverse)
library(infer)
library(mosaic)
library(caret)
library(ggplot2)
library(mdsr)
library(rpart)
library(rpart.plot)
library(partykit)
library(randomForest)
library(pROC)
library(gbm)
library(Metrics)
library(viridis)
require(maps)
require(viridis)
theme_set(
  theme_void()
  )
library(knitr)
library(tmap)
library(ggpubr)

```

```{r, echo = FALSE, warning = FALSE, message = FALSE}
library(readr)

wei <- read_csv("DATASETS/women_empowerment_index.csv")

led <- read_csv("DATASETS/led.csv")

```


# The Premise
Hello! This project is the final for my STAT228: Introduction to Data Science course at Simmons University. With this project, I'd like to encompass everything I've learned throughout the semester, as well as some additional data science methods that I have taught myself outside of class. (The primary non-curriculum methods I am using come from the package **tmap**, which I was made aware of from a LinkedIn post by Joachim Schork, a data science educator & consultant from Germany). The premise of my project is to predict & analyze Women's Empowerment Index scores for countries ; in this project, I aim to find the best version of the model predicting WEI scores using a variety of ensemble methods. There are two datasets I'm interested in using here: 

- 1. the *wei* dataset, which stands for Women's Empowerment Index, is the first dataset I'm going to be using. It is sourced from Human Development Reports, and can be accessed at the link below. (**https://www.kaggle.com/datasets/iamsouravbanerjee/women-empowerment-index**). It contains information on the WEI scores for 114 countries. 
- 2. The second dataset I'm interested in using is the *led* dataset, which stands for Life Exptectancy Dataset, and is sourced from the World Health Organization. This dataset can be accessed at the link below. (**https://www.kaggle.com/datasets/augustus0498/life-expectancy-who/data**). This dataset contains important health-centered data on every country in the world.

I'd like to join the two datasets by the common variable "Country", and analyze WEI scores by health factors related to the patient's country. I am interested in creating several maps that will visualize WEI scores against other health-based factors. 

# Cleaning the Data
I want to start by filtering *led* to only include data where the year = 2015, because this is the most recent year within this dataset.
```{r, class.source = "fold-show"}

led_2015 <- 	led%>%
	filter(Year=="2015")

```

Now, I want to rename several datapoints within the *led_2015* dataset because I want the data names to be congruent between *led_2015* and *wei*. This will be very tedious, but it is necessary to be able to join the datasets! 
```{r}

led_2015[led_2015$Country == "Bolivia(PlurinationalStateof)", "Country"] <- "Bolivia"

wei[wei$Country == "Bolivia(Plurinational State of)", "Country"] <- "Bolivia"

led_2015[led_2015$Country == "BosniaandHerzegovina", "Country"] <- "Bosnia and Herzegovina"

led_2015[led_2015$Country == "BurkinaFaso", "Country"] <- "Burkina Faso"

wei[wei$Country == "Congo (Democratic Republic of the)", "Country"] <- "DemocraticRepublicoftheCongo"

led_2015[led_2015$Country == "CostaRica", "Country"] <- "Costa Rica"

led_2015[led_2015$Country == "DominicanRepublic", "Country"] <- "Dominican Republic"

led_2015[led_2015$Country == "ElSalvador", "Country"] <- "El Salvador"

led_2015[led_2015$Country == "Iran(IslamicRepublicof)", "Country"] <- "Iran (Islamic Republic of)"

led_2015[led_2015$Country == "LaoPeople'sDemocraticRepublic", "Country"] <- "Laos"

wei[wei$Country == "Lao People's Democratic Republic", "Country"] <- "Laos"

led_2015[led_2015$Country == "RepublicofMoldova", "Country"] <- "Moldova (Republic of)"

led_2015[led_2015$Country == "SierraLeone", "Country"] <- "Sierra Leone"

led_2015[led_2015$Country == "SouthAfrica", "Country"] <- "South Africa"

led_2015[led_2015$Country == "SriLanka", "Country"] <- "Sri Lanka"

led_2015[led_2015$Country == "TheformerYugoslavrepublicofMacedonia", "Country"] <- "North Macedonia"

led_2015[led_2015$Country == "UnitedRepublicofTanzania", "Country"] <- "Tanzania (United Republic of)"

wei[wei$Country == "Türkiye", "Country"] <- "Turkey"

led_2015[led_2015$Country == "UnitedArabEmirates", "Country"] <- "United Arab Emirates"

led_2015[led_2015$Country == "UnitedKingdomofGreatBritainandNorthernIreland", "Country"] <- "United Kingdom"

led_2015[led_2015$Country == "UnitedStatesofAmerica", "Country"] <- "United States"

led_2015[led_2015$Country == "VietNam", "Country"] <- "Viet Nam"

```
That took a WHILE! But, now my two datasets should be better matched, and it'll be much easier to join them! 

Now I'm going to remove some unnecessary variables from the datasets This is because some variables have too many missing datapoints to be useful, or because some variables may be redundant. 
```{r, class.source = "fold-show"}

led_2015 = subset(led_2015, select = -c(Alcohol, Totalexpenditure,Year))

```

Now, I will be joining the two datasets using inner_join, because I only want the countries in common between both dataframes.  I'm also going to create a second version that is full_joined, just in case I end up needing it!
```{r, class.source = "fold-show"}
dataset <- wei%>% inner_join(led_2015,by="Country")

dataset2 <- wei%>% full_join(led_2015,by="Country")
dim(dataset)
```

Finally, I will be renaming a few variables because I dislike having spaces in my variable names. After this, I will have finished cleaning the data! I'm also removing a few more variables just because they seem rather redundant - for example, the Gender Parity Group and Gender Parity Index variables may be too similar to WEI, therefore they might have skewed levels of correlation.
```{r, class.source = "fold-show"}
dataset <- dataset %>%
  rename(WEI = "Women's Empowerment Index (WEI) - 2022")

dataset <- dataset %>%
  rename(WEG = "Women's Empowerment Group - 2022")

dataset <- dataset %>%
  rename(GGPI = "Global Gender Parity Index (GGPI) - 2022")

dataset <- dataset %>%
  rename(HDG = "Human Development Group - 2021")

dataset <- dataset %>%
  rename(GPG = "Gender Parity Group - 2022")

dataset <- dataset %>%
  rename(SDG_Region = "Sustainable Development Goal regions")
```

```{r, class.source = "fold-show"}
dataset = subset(dataset, select = -c(GPG, SDG_Region,GGPI))
```
All finished cleaning! Now I want to take a peek at the dataset I'll be working with before I go into the next steps. 
```{r, class.source = "fold-show"}
dim(dataset)
summary(dataset$WEI)
tally(dataset$Status)
```
# Intial Visualization 
Now, I want to create a map of WEI by country, just to get a general visual of how the distribution looks. 

I'm going to start out by creating a joined dataset between *dataset* and *World*, so I can have a mappable dataset using the **tmap** package. 
```{r, class.source = "fold-show"}
library(tmap)
data("World")

dataset<- dataset %>%
  rename(name = Country)

wei.map <- inner_join(dataset,World,by="name")

```
Unfortunately, the *wei.map* only contains 98 individuals despite *dataset* containing 112 individuals! This means that 14 individuals have disharmonious names between *dataset* and *World*. I'm going to run some code renaming a whole bunch of countries, just like I did above with the *led_2015* dataset. I'm going to hide this code, just because it's super tedious work. Tidying & cleaning your data is a constant process!
```{r}
dataset[dataset$name == "Bosnia and Herzegovina", "name"] <- "Bosnia and Herz."

dataset[dataset$name == "DemocraticRepublicoftheCongo", "name"] <- "Dem. Rep. Congo"

dataset[dataset$name == "Dominican Republic", "name"] <- "Dominican Rep."

dataset[dataset$name == "Iran (Islamic Republic of)", "name"] <- "Iran"

dataset[dataset$name == "Laos", "name"] <- "Lao PDR"

dataset[dataset$name == "North Macedonia", "name"] <- "Macedonia"

dataset[dataset$name == "Moldova (Republic of)", "name"] <- "Moldova"

dataset[dataset$name == "Tanzania (United Republic of)", "name"] <- "Tanzania"

dataset[dataset$name == "Viet Nam", "name"] <- "Vietnam"
```
Now, let's try re-joining these datasets!
```{r, class.source = "fold-show",message=FALSE}
wei.map <- right_join(dataset,World,by="name")
library(s2)
library(sf)
wei.map.sf  <- wei.map %>% 
  st_as_sf()
```
And visualizing:
```{r, message=FALSE}

tmap_mode("plot")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

```
Lovely! I also made an interactive version of the map, just for further exploration of the tmap package.

```{r, message=FALSE}

tmap_mode("view")
tm_shape(wei.map.sf) +
    tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

```
Now, I want to move onto creating models that will help me predict the WEI scores of countries based on other factors.

# Decision Tree
I'm going to start by creating a decision tree for my *dataset*. I believe this will be beneficial to see, seeing as the decision tree will select the best predictors for my response variable. The first step here is to split my data into training and testing data.
```{r, class.source = "fold-show",message=FALSE}

set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
```
I'm going to look at the proportion of developed vs developing countries within both my *test* and *train* data. This is so I can ensure that *test* is at least somewhat representative of the distribution within *train*.
```{r, class.source = "fold-show",message=FALSE}
tally(test$Status, format='proportion')
tally(train$Status, format = 'proportion')
tally(dataset$Status, format = 'proportion')
```
Both *test* and *train* appear to be adequately representative of *dataset*!

Now, I will create my decision tree. I will fit the decision tree to *train*, and I will set nearly every predictor variable (BESIDES NAME & WEG, for reasons of redundancy) as my X. I am doing this so that the decision tree can tell me which predictor variables have the highest influence on my response variable, WEI.
```{r, class.source = "fold-show",message=FALSE}
tree <- rpart(WEI~ HDG + Status + Lifeexpectancy + AdultMortality + infantdeaths + HepatitisB + Measles +  Polio + Diphtheria + Population + Schooling, data=train, cp=.01)

rpart.plot(tree)
```

It appears there are three predictors that have the highest correlation with WEI scores : HDG (Human Development Group), Years of Schooling, and Adult Mortality Rates. For further exploration, I'm going to create a Random Forest to find our best model.

# Random Forest & RMSE
I'm going to have to not use any variables that have "NA" values to be able to properly crearte a random forest. This unfortunately means removing Schooling. However, I don't want to remove Schooling because according to the decision tree, it's one of the most important predictors! So, I am going to use regression imputation to create some estimated Schooling values for any of our countries that have Schooling = NA. 
```{r, class.source = "fold-show", message=FALSE}
missing <- is.na(dataset$Schooling) 

sum(missing)

which(missing)

lm(Schooling~WEI, data=dataset)


dataset$Schooling[11]<-2.574+17.943*(0.707)
dataset$Schooling[30]<-2.574+17.943*(0.778)
dataset$Schooling[31]<-2.574+17.943*(0.752)
dataset$Schooling[47]<-2.574+17.943*(0.686)
dataset$Schooling[72]<-2.574+17.943*(0.399)
dataset$Schooling[105]<-2.574+17.943*(0.510)
```
Yet another example of never being done with cleaning the data! Now, I have to recreate my training and testing dataset without these missing values. 
```{r, class.source = "fold-show", message=FALSE}
set.seed(228)
train.index <- createDataPartition(dataset$WEI, p= 0.8, list = FALSE)
train <- dataset[train.index,]
test <- dataset[-train.index,]
```

Then, I can finally create my Random Forest
```{r, class.source = "fold-show",message=FALSE}
set.seed(228)
rf_model <-randomForest(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria ,data=train, proximity=TRUE, ntree=1000)
rf_model

```
The mean of squared residuals is pretty close to 0, which means the model is good! However, I'm going to also try running a Random Forest model in which I only use the three best predictor variables as identified by my Decision Tree. I will then compare the two mean of squared residuals.
```{r, class.source = "fold-show",message=FALSE}
set.seed(228)
rf_model2 <-randomForest(WEI ~ HDG +  Schooling + AdultMortality  ,data=train, proximity=TRUE, ntree=1000)
rf_model2

```
The mean of squared residuals is ever so slightly higher for this model, and the Var explained is slightly lower. I'm going to use an RMSE test to compare the two models.


Now, I'm going to boost the two models. Before doing so, I am changing HDG's class to ordered, and Status's class to ordered, so they arecompatible with the boosted model.
```{r, class.source = "fold-show",message=FALSE}
train$HDG <- as.ordered(train$HDG)
train$Status <- as.ordered(train$Status)
boost1 <- gbm(WEI ~ HDG + Status + Lifeexpectancy + Schooling + AdultMortality + infantdeaths + Measles +  Polio + Diphtheria, data=train,distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)
boost2 <- gbm(WEI~ Schooling + HDG + AdultMortality , data = train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, n.minobsinnode=5)

```
Now I will evaluate the two models using predictions & RMSE.
```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
predictions1 <- predict(boost1,newdata=test)
rmse(actual=test$WEI, predicted=predictions1)

predictions2 <- predict(boost2, newdata = test)
	rmse(actual=test$WEI, predicted=predictions2)
```
Once again, the model with nearly very variable has proved to perform *slightly* better than the model with only 3 predictor variables. However, this difference is **so small**, and I think it is worth it to sacrifice this small difference for the sake of keeping our model SIMPLE! The model still performs very well. So, I'm deciding to keep the model that only contains the three best predictor variables! 

# Final Visualizations
```{r, message=FALSE,warning=FALSE}
wei.map <- right_join(dataset,World,by="name")

wei.map.sf  <- wei.map %>% 
  st_as_sf()
```
I'm going to create FOUR maps here: One exactly like my first visualization that shows WEI scores by country, and then three that shows the distribution of each of my predictor variables! I will display all of these maps side by side.
```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
tmap_mode("plot")

 tm_shape(wei.map.sf) + tm_polygons("WEI",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Women's Empowerment Index Scores", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("HDG",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Human Development Groups", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("Schooling",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Plot of Years of Schooling", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)

tm_shape(wei.map.sf) + tm_polygons("AdultMortality",palette="magma",n=8) +
tm_layout(inner.margins = c(.05, .05, .05, .05), sepia.intensity=.01, frame.double.line=1, title = "Adult Mortality per Population of 1000", title.position = c("right", "bottom"), title.fontface="bold",title.fontfamily="Times", legend.title.fontfamily = "Times",legend.title.fontface="bold", title.bg.color="lightsalmon", title.size=1.8, legend.position=c("left","center"), saturation = .9, title.bg.alpha=.7)
```

These are sort of confusing. For one, the HDG categories are out of order, which is very hard to read. Secondly, the adult mortality map should probably have the colors flipped, because lower adult mortality corresponds to *higher* WEI.

I'm going to fix them, and insert the png below.

```{r}
knitr::include_graphics("/Users/jasmineday/Downloads/WEIPLOT.jpg")
```

 And there we have our final visualization!  I hope you enjoyed reading through this project as much as I enjoyed my STAT228 class :)
 
# BONUS: Testing the Prediction Model
After finishing the first draft of my project, I figured I wanted to add a section in which I take a small handful of countries that are NOT represented in the *wei* dataset and use our model to predict the WEI scores for these nations. I will be using the countries Kazakhstan, New Zealand, Sudan, Haiti, and Ukraine. I will use *led_2015* to source the HDG, Schooling, and AdultMortality for each of these countries.

```{r, class.source = "fold-show", message=FALSE,warning=FALSE}
finalmodel <- lm(WEI~HDG+Schooling+AdultMortality,data=dataset)
finalmodel
#Kazakhstan:
  #HDG=Very high
  #Schooling=15
  #AdultMortality=198
Kazakhstan_WEI <- .0186 + .0717 + .0282*15 + .0002*198
#New Zealand  
  #HDG=Very high
  #Schooling=19.2
  #AdultMortality=66
NZ_WEI <- .0186 + .0717 + .0282*19.2 + .0002*66
#Sudan
  #HDG=Low
  #Schooling=7.2
  #AdultMortality=225
Sudan_WEI <- .0186 - .0749 + .0282*7.2 + .0002*225
#Haiti
  #HDG=Medium
  #Schooling=9.1
  #AdultMortality=24
Haiti_WEI <- .0186 - .0202 + .0282*9.1 + .0002*24
#Ukraine
  #HDG=High
  #Schooling=15.3
  #AdultMortality=195
Ukraine_WEI <- .0186 + .0282*15.3 + .0002*195
```

Kazakhstan imputed/predicted WEI score = .553
New Zealand imputed/predicted WEI score = .645
Sudan imputed/predicted WEI score = .192
Haiti imputed/predicted WEI score = .260
Ukraine imputed/predicted WEI score = .489

These don't feel entirely accurate. For example, Sudan's imputed WEI score is just above that of Yemen's, which is the lowest in the world. However, it still can be beneficial to see where there may be potential flaws in using this linear model to predict WEI scores! WEI score data on *all* countries is not public, so I am unfortunately unable to check the actual accuracy of these predictions. 
